NBA PCA Analysis

Looking at similarities between NBA players from the 2015-2016 season

Roupen Khanjian
01-25-2021
Code
library(tidyverse) # Easily Install and Load the 'Tidyverse', CRAN v1.3.0
library(janitor) # Simple Tools for Examining and Cleaning Dirty Data, CRAN v2.1.0
library(here) # A Simpler Way to Find Your Files, CRAN v1.0.1
library(scales) # Scale Functions for Visualization, CRAN v1.1.1
library(ggfortify) # Data Visualization Tools for Statistical Analysis Results, CRAN v0.4.11
library(gghighlight) # Highlight Lines and Points in 'ggplot2', CRAN v0.3.1
library(plotly) # Create Interactive Web Graphics via 'plotly.js', CRAN v4.9.3
library(gt) # Easily Create Presentation-Ready Display Tables, CRAN v0.2.2 

Brief Introduction to Data

The data used for this task was obtained from the following link: data. I decided to analyze data from the National Basketball Association (NBA) player statistics from the 2015-2016 season. Each observation in this dataset is a player’s per game statistics. I choose to use PCA in order to see how the players differed across 11 features that are deemed to be important for a basketball player’s success.

Data Wrangling and PCA

Code
nba_players <- read_csv(here("_texts", 
                             "NBA_PCA",
                             "data", "nba_players.csv")) %>% 
  clean_names() %>% 
  separate(player, into = c("player", "html"), sep = "\\\\") %>% # clean the player name column
  dplyr::filter(mp > 18) %>% # filter for players who played over 18 minutes a game (out of a possible 48)
  dplyr::filter(g > 30) %>% # filter for players who played over 30 games (out of a possible 82)
  drop_na(age, fga, e_fg_percent, ft_percent, trb:pts)  # drop observations with missing values 

nba_players_pca <-  nba_players %>%  
  dplyr::select(age, fga, e_fg_percent, ft_percent, trb:pts) %>% # select the features for pca
  scale() %>% # scale the features
  prcomp() # run pca

# Quick look at the data
nba_players %>%
  dplyr::select(player, pos, age, fga, e_fg_percent, ft_percent, trb:pts) %>% 
  filter(player %in% sample(player, size = 5)) %>% 
  gt() %>% 
    tab_header(
      title = "Statistics from a Random Sample of Five Players",
      subtitle = "From the 2015-2016 NBA regular season"
    ) %>% 
    fmt_percent(
      columns = vars(e_fg_percent, ft_percent),
      decimals = 1
    ) %>% 
  tab_style(
    style = list(
      cell_text(style = "italic"),
      cell_borders(
        side = c("right"), 
        color = "black",
        weight = px(2)
        )
    ),
    locations = cells_body(
      columns = 1
    ))  %>% 
  cols_label(
    pos = "position"
  ) 
Statistics from a Random Sample of Five Players
From the 2015-2016 NBA regular season
player position age fga e_fg_percent ft_percent trb ast stl blk tov pf pts
Isaiah Canaan SG 24 9.4 48.2% 83.3% 2.3 1.8 0.7 0.2 1.2 1.7 11.0
Michael Carter-Williams PG 24 10.3 46.6% 65.4% 5.1 5.2 1.5 0.8 2.8 3.0 11.5
Monta Ellis SG 30 12.6 47.0% 78.6% 3.3 4.7 1.9 0.5 2.5 2.1 13.8
Brandon Jennings PG 26 6.3 45.2% 73.1% 2.0 3.5 0.6 0.1 1.2 1.2 6.9
Dirk Nowitzki PF 37 14.8 50.4% 89.3% 6.5 1.8 0.7 0.7 1.1 2.1 18.3

Biplot

Code
autoplot(nba_players_pca,
         data = nba_players,
         loadings = TRUE,
         loadings.label = TRUE,
         loadings.colour = "khaki2",
         loadings.label.colour = "black",
         loadings.label.fontface = "bold",
         colour = "pos" # organize colors based off position
         ) +
  labs(title = "Biplot for PCA",
       caption = "Biplot of NBA players basic statistics 
       from the 2015-2016 NBA season.\n Colors are organized by position.",
       colour = "Position") +
  theme_minimal() +
  theme(axis.title = element_text(face = "bold", size = 12),
        panel.grid.minor = element_blank(),
        plot.title = element_text(face = "bold", size = 13)
        )

A few observations from the above biplot:

Biplot Highlighting a Few Players

Below is the same biplot but I decided to highlight the 5 best players for that season (according to the MVP voting which can be found here: MVP voting) .

Code
autoplot(nba_players_pca,
         data = nba_players,
         loadings = TRUE,
         loadings.label = TRUE,
         loadings.colour = "khaki2",
         loadings.label.colour = "black",
         loadings.label.fontface = "bold",
         colour = "player"
         ) +
  labs(title = "Biplot for PCA",
       subtitle = "Top 5 players in MVP Voting are Highlighted",
       caption = "Biplot highlighting some of the best players for the 2015-2016 NBA season") +
  gghighlight(player %in% c("Kawhi Leonard", "Stephen Curry", "LeBron James",
                            "Russell Westbrook", "Kevin Durant")) + # top 5 players in MVP voting
  theme_minimal() +
  theme(axis.title = element_text(face = "bold", size = 12),
        panel.grid.minor = element_blank(),
        plot.title = element_text(face = "bold", size = 13),
        plot.subtitle = element_text(size = 11)
        )

Biplot Using plotly to see Similarities Between Players

Lastly, in order to see which players are similar to one another I made an interactive plot where you can hover over each data point to revel the name of the player.

Code
nba_pca_plot <- autoplot(nba_players_pca,
         data = nba_players,
         loadings = TRUE,
         loadings.label = TRUE,
         loadings.colour = "khaki2",
         loadings.label.colour = "black",
         loadings.label.fontface = "bold",
         colour = "player", # organize colors based off position,
         colour.show.legend = FALSE
         ) +
  labs(title = "Interactive Biplot") +
  theme_minimal() +
  theme(axis.title = element_text(face = "bold", size = 12),
        panel.grid.minor = element_blank(),
        legend.position="none",
        plot.title = element_text(face = "bold", size = 13)
        )

ggplotly(nba_pca_plot, tooltip = "player") # interactive plot